[GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling#12042

Merged

jinchengchenghh merged 2 commits intoapache:mainfrom

Yao-MR:feature/optimize-parquet-metadata-validation-sampling

May 11, 2026

Contributor

Yao-MR commented May 6, 2026

What changes are proposed in this pull request?

When a table has many partitions, the metadata validation checks every root path with fileLimit files each, resulting in excessive I/O cost.

This patch introduces a sampling mechanism that selects a percentage of root paths for validation instead of checking all of them. The file limit is distributed evenly across the sampled paths.

Key changes:

Add config spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage with default value 0.1 (10% sampling)
Use evenly spaced interval sampling for good partition coverage
Add unit tests for the sampling logic

How was this patch tested?

Existing tests in ParquetEncryptionDetectionSuite continue to pass without modification.

Was this patch authored or co-authored using generative AI tooling?

ISSUE: #11782

github-actions Bot added CORE VELOX DOCS labels

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 706c87f to 970fb06 Compare

May 6, 2026 06:09

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

Yao-MR marked this pull request as ready for review

May 6, 2026 09:40

Yao-MR changed the title ~~[GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling…~~ [GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling

Yao-MR closed this

Yao-MR reopened this

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

Yao-MR closed this

Yao-MR reopened this

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 51b280f to e99a01e Compare

May 7, 2026 01:43

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Contributor Author

Yao-MR commented May 7, 2026

hi @jinchengchenghh, make a optimized sample validation which can improve the performance.
can you make a review when you have a change ?
thx.

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Yao-MR closed this

Yao-MR reopened this

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Yao-MR closed this

Yao-MR reopened this

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Yao-MR closed this

Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 8186b6c to 93de3c8 Compare

May 7, 2026 12:10

Yao-MR reopened this

Yao-MR marked this pull request as draft

May 7, 2026 12:19

github-actions Bot added CLICKHOUSE and removed CORE VELOX DOCS labels

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

2 similar comments

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Yao-MR marked this pull request as ready for review

May 7, 2026 12:26

Yao-MR marked this pull request as draft

May 7, 2026 12:26

Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 1b65c49 to e99a01e Compare

May 7, 2026 12:27

github-actions Bot added CORE VELOX DOCS and removed CLICKHOUSE labels

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented May 8, 2026

Run Gluten Clickhouse CI on x86


          [GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling…

f2334a0

… root paths

When a table has many partitions, the metadata validation checks every
root path with `fileLimit` files each, resulting in excessive I/O cost.

This patch introduces a sampling mechanism that selects a percentage of
root paths for validation instead of checking all of them. The file limit
is distributed evenly across the sampled paths.

Key changes:
- Add config `spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage`
  with default value 0.1 (10% sampling)
- Use evenly spaced interval sampling for good partition coverage
- Add unit tests for the sampling logic

Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 314d4b7 to f2334a0 Compare

May 8, 2026 13:40


          Merge branch 'main' into feature/optimize-parquet-metadata-validation…

4cd7950

…-sampling

github-actions Bot commented May 8, 2026

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented May 8, 2026

Run Gluten Clickhouse CI on x86

Yao-MR marked this pull request as ready for review

May 9, 2026 01:58

Contributor Author

Yao-MR commented May 11, 2026

hi @jinchengchenghh as discussed in the #11782, have implement the sample module which can improve the performance of validation, could you make a review when you have a change , thans!

jinchengchenghh approved these changes

View reviewed changes

Contributor

jinchengchenghh left a comment

Thanks!

jinchengchenghh merged commit 6298aeb into apache:main

60 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE DOCS VELOX